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The development of job-specific tests (JSTs) for two 
occupations is discussed • A reading comprehension test and a 
mathematical reasoning test were developed for Customs Inspectors , 
and a reading comprehension test was developed for Social Security 
Claims workers. JST items incorporated reading samples or math 
problems from those found on the job. Each job-specific reading test 
contained 40 items, and the Customs math test contained 30 items. 
Panels of subject matter experts rated tasks and test items. 
Correlational and factor analyses that related the two reading tests 
and the math test to cognitive or non-cognitive marker tests showed 
that the JSTs were cognitive tests that measured traditional verbal 
and mathematical abilities. Studies of the Customs tests with about 
4,500 job applicants have confirmed the high reliabilities and 
generally good validities of the tests. The Claims worker test was 
not used operationally. Effect sizes for the Black (n=about 1,000) 
and Hispanic (n=about 1,000) Customs Inspector applicants were all 
close to one standard deviation with respect to the majority White 
group (n=about 2,500) , which is typical of group .differences 
associated with cognitive ability test scores. Research into 
alternative means of examining job applicants to reduce group 
differences indicated that combinations of interviews and tests, one 
of which should be a general cognitive ability test, can reduce group 
differences withou*- losing test accuracy. Three tables present study 
data. (SLD) 
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Job Specific Tests and an Overview of Research on Alternatives 

In this presentation, I will discuss something termed a -job specific- 
test and make some general remarks about the alternatives we studied. 
The device that we have called a "job specific" test is misnamed in that 
it is the minimum alternative; minimum in that, of the cognitively-oriented 
alternative tests, it f s development involved the least replication of the 
job in the test* A reading comprehension test and a mathematical reasoning 
test were developed for Customs inspectors and a reading comprehension 
test was developed for Social Security claims workers. Customs inspectors 
do inspect! onal work in the enforcement of the Tariff Act and other laws 
governing the importation or exportation of merchandise. Claims workers 
adjudicate claims against ,the government by evaluating the legitimacy of 
an initial claim for retirement, disability, and/or health insurance 
benefits and by determining the amount of benefits to be paid initially 
and as the claim matures. 

Job specific test items were written incorporating samples of reading 
materials or math problems selected representatively from those found in 
the job. A sample math item might ask Customs inspector applicants to 
pick, from multiple choices, the correct amount of duty to collect on 20 
scarves worth $5.00 each when the specific duty rate is .16. To measure 
job-related xvading skills, an applicant could be required to read a 
short paraphrased Customs or Social Security regulation and then pick the 
statement which is best supported by the paragraph. Table 1 in your handout 
shows examples of the kind of item which was developed for the Customs 
math and reading tests. The social security reading test was very similar 
in style to the Customs reading test. 



In the development of the Customs tests, two panels of Customs subject 
matter experts (SME f s) Independently rated the learning and application 
of Customs laws and regulations and the collection of applicable duties 
and -taxes as having "great importance" in Customs inspector work. To 
measure whether an applicant could perform these duties, a test of reading 
comprehension based on Customs-related laws and regulations and a test of 
mathematics reasoning based on the collection of duties and taxes were 
developed for the selection of Customs agents. 

To, begin development of the Social Security test, fifty claims SHE f s repre- 
senting the various occupational series included in this type of social 
security work rated seven tasks relating to the learning and interpreting 
of social security rules and regulations as having high importance. The 
tasks were representative of the jobs found in the Claims area. A reading 
comprehension test based on randomly selected passages taken from social 
security rules and regulations manuals was developed. The process followed 
in the development of the Customs reading test included the following 
major step*: generating the essential reading list, determining the 
reading level of the job-related material, writing test items, and reviewing 
the test items. The source of the test items was a list of essential 
Customs inspector reading material that had been reviewed by a sample of 
entry-level inspectors and first line supervisors. The reading level for 
the job was calculated from the average scores for each book of reading 
materials (Payne, 1976). Then a panel of Customs inspectors was convened 
and given instructions on item writing by an OPM psychologist. The items 
were based on reading passages selected randomly from the essential 
reading materials. 



the process followed in the development of the Customs math test paral- 
leled that of the Customs reading test: initially, a group of job-related 
math-oriented materials was culled out by a panel of six Customs inspector 
SME's. The next step, the selection of math item types, did not have a 
reading test counterpart because math-related written material is reaaily 
converted to one particular reading test item type* The panel identified 
16 tasks which were appropriate for testing. The panel also determined 
that two formats would be used for the items in the test: one type — the 
word problem — would present the required information in a narrative form, 
the second type — the table problem— ^would implant the data used to solve 
the problem among other data in a table or schedule. 

The Claims reading test development began with the assembly of essential 
reading materials at job sites in three cities* A random sample of pages 
from these materials were selected for analysis of reading levels. Reading 
passages which fell within the average reading, level for all. the material 
were used as the basis for test items. 

Each of the job specific reading tests contained 40 items. These tests 
were relatively easy. In the research samples, the mean of the Claims 
test was 30 (of 40 items) and the Customs reading test mean was 28. The 
Customs math test which had 30 items was more difficult with a mean of 
17. The reliabilities were all in the .80 f s. Correlational and factor 
analyses which related the two reading tests and the math test to the 
cognitive and non-cognitive marker tests show that the job specific 
tests are cognitive ability tests which measure the traditional verbal 
and mathematical abilities which are the primary components of classic 



cognitive ability tests. 

Concurrent criterion-related studies were carried out against training 
success and job performance. Training success was measured in Customs by 
ci*ssrooE! tests and in Social Security by ratings of training instructors. 
The performance rating measure duplicated in format the one used in 
studies of the other alternatives and it was used solely as a research 
instrument for which results were retained only in OPM files. Some of 
the dimensions which it measured varied with the occupations, but many of 
the dimensions were identical to those measured in the studies of the 
other alternatives. 

In general, the validity coefficients were typical of cognitive ability 
tests used for selection. Tfie mean validity for all three tests against 
training criteria was .51 (corrected for unreliability), against job perfor- 
mance it was .37. 

The best estimates of expected group differences on these measures are 
based on applicant data. Unfortunately, these are available only for the 
Customs tests because a decision was made on administrative grounds not 
to use the Claims test operationally. There have been about 1000 
Hispanic and 1000 black applicants and about 2500 white applicants for 
Customs Inspector positions. The reliabilities of these tests are high 
and comparable and the sample sizes are relatively large so the estimates 
of groups differences should be fairly stable. The effect sizes for the 
black and Hispanic groups are all close to one standard deviation with 
respect to the majority white group. These estimates are close to those 
observed with the MT&E and the job knowledge test and are equivalent to 



the difference cited by researchers as being typical of group differences 
associated with cognitive ability test scores. Thus, the data on job 
specific tests do not support the hypothesis that building content valid- 
ity into a cognitive test will reduce group differences. Validity, 
relative to cognitive ability tests in general, has been retained but so 
have the group differences. In sum, the job specific tests behaved as 
good cognitive tests should. 

Initially I referred to the job specific test as the minimum alternative. 
In our studies, we wanted to see whether different forms of job specifi- 
city in test content and format could reduce group differences. The 
theory which led to this strategy is related to one of the five primary 
possible sources of test bias which Reynolds (1983) has outlined: 
although the points he made were couched in an educational context, it is 
useful to consider them because they reveal how thin pur theorizing is in 
this area In paraphrase, they are (1). that the content of the tests is 
incompatible with the learning experiencies of minorities, (2) that the 
standardization samples of the tests don f t include enough minorities, (3) 
that the language of the test is culturally alien, (4) that tests measure 
different attributes for different groups, and (5) that tests predict 
important criterion components differently or not ail for minority members. 

Of these arguments, the last is the only one which is completely compatible 
with the consistent finding that differential validity is a chance phenomenon 
(Bar tie tt, Bobkc, Mosier, & Hannan, 1978; Hunter, Schmidt, and Hunter, 
1979), That is, a test may be equally valid for the selection of members 
of all groups and still there may be the implication of unfairness in the 



selection process if one or more important criterion components are not 
predicted by the test and if these components may be predicted validly by 
another measure for which group; differences are less* This reasoning 
leads, in its extreme form, to the Cosmic Search* The unreasonableness 
of the Cosmic Search comes about because it is difficult to find valid 
predictors of the job components which are not predicted by traditional 
cognitive tests and because we have, no good theory, of group differences 
in test scores so we, don 1 t know what to look for* (To say that group 
differences are due to differences in a general cognitive factor has not, 
by itself, led to many testable hypotheses for designing alternative 
tests)* 

We took the approach that if we developed measures which were more job 
specific than a traditional cognitive ability test (that is, more like 
the job: in content or format), that we would be more likely to measure 
noncognitive components of the criterion or perhaps nontraditional 
cognitive components and that these measures might be valid and have 
smaller group differences* 

table 2 in the handout summarizes the results of the research studies we 
have been discussing* It shows the studies done for each procedure and 
summary and descriptive statistics for these studies* It is clear that 
the validities of these instruments are generally good, with the excep- 
tion of the JCPS, and the E and E measures for which there was an inade- 
quate data base» The validities for these measures are comparable to 
those reported for traditional cognitive ability tests* The descriptors 
(e.g., "good", "moderate") used to characterize the validities reflect 



both types of criteria and also reflect the level of corrections made to 
each statistic* This should be considered in making coiq>arisons between 
procedures* 

Secondly, factor analyses indicate that the MT&E, the job specific teste, 
and the job knowledge tests load heavily on a general cognitive factor 
and that these are the tests which show the largest effect sizes and the; 
highest validities. (Only black-white differences are considered in theiie 
analyses.) The structured interview has a slightly lower overall validity, 
loads much less on the general cognitive factor, and has considerably 
lower effect size* The JCPS has little or no validity and very small 
effect sizes* The structured interview performed very well and seems to 
offer the best opportunity for reducing group differences* Before 
deciding that selections should be made on the basis of the interview 
9 alone, it should be remembered that the supervisory ratings used, as 
criteria were collected for this research only* They would be freer from 
error than the typical ratings* More importantly,, the structured inter- 
view* was extensively and carefully developed with behavioral benchmarks 
to aid the raters 1 judgments. There were at least two raters, trained 
with videotapes produced for these studies, rating each candidate. Thus 
it is probable that the ceiling of the validity of the usual structured 
interview is lower than was observed in these studies. 

If these conclusions concerning the structured interview are true, then 
the loss of validity by using it alone relative to a good cognitive 
ability test with a generalizable validity of over .50 would be considerable. 
An alternative is to use both a cognitive test and an interview. In 
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order to estimate the validity and group differences when these instru- 
ments together for selection, an analysis was made of the MT&E and the 
structured interview as an equally weighted composite with a composite 
validity and and effect size. This analysis parallels one suggested by 
Schmidt (1988). 

The basic data and results are shown in Table 3. The effect sizes of the 
two measures were estimated by cumulating across samples* Very small 
samples from some occupations were not included in the meta-analyses* 
The effect size of an equally weighted composite of the two instruments 
was estimated from the mean N-weighted cumulated effect size estimates; 
The validity of an equally weighted composite was estimated from the 
corrected estimates of the validities of the MT&E and the interview 
provided in the reports on these instruments* The results shown in Table 
3 indicate that, even after correction for the composite unreliability* 
the effect size is *83* This is a reduction from the one standard devi- 
ation difference which has been our 'basis for comparison* The composite 
validity is .61. This validity could be even higher if regression weights 
were used. One caveat is that there was an unknown amount of indirect 
restriction in range on the interview scores. Comparison of the vari- 
ances of the scores in the cumulated samples with other samples in which 
there should have been no restriction does not indicate that this should 
have been a problem. 

These results support a strategy of test development which seeks to 
optimize combinations of tests, one of which should be a general cognitive 
ability test. There is obviously much work that can be done. It is very 
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promising, however, that there appears to be a psychometric methodology 
which can reduce group differences in selection rates without lowering 
the accuracy of our tests. 

This strategy does not relieve the test user of making utility decisions. 
The increased costs of administering alternative measures must be weighed 
against the probable decrease in adverse impact and increase in validity. 
The cost for the interview, for example, might be. considerable. 
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Table 1 

Examples of Job Specific Test Items 



Customs Math Item Exarrple 

Sample Question 2: An importer hat a shipment of 2,000 pent of equal value with a total value of 

$800.00. The duty rate on pens valued at 10* or more but not over 50* per pen is 
, 8% of their value; the duty on pens valued over 50* but not more than $1.00 per 
pen is 6% of their value. How much duty is paid on the shipment of pens? 

A) $ 48.00 D) $180.00 

B) $ 64.00 E) None of these 

C) $160.00 



Customs Beading Item Example 

Sample Question S. 

When Congress passes a law, it does not include within the law details about how/the law is to be 
administered. Therefore, for each law Congress authorizes, the department or agency that administers 
the law issues such rules and regulations as are necessary for its enforcement The rules and 
regulations are usually published in proposed form in the Federal BegUter for public comment. 

Select the etatement that is be$t Supported by the paragraph. 

A) Public comment on laws proposed by Congress are published in the Federal Register. 

B) The Federal RegUter must accept the rules and regulations that are published. t 

C) Congress empowers the agency that administers a law to set forth rules and regulations. 

D) The legislative process may differ with different laws. 

E) Congress establishes guidelines for enforcing the laws it passes. 
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Table 2 

Surrmary of Alternatives Research 



IttTMtivc 

>recedure - 



Occupetion(s) 

end. 
Inplcacntation 



Whet It 
Measures 



Validity 



Total 
Staple Site 



Igpoct 



Co— ntt 



JaTiW 



PCPS 



leVKnov- 



•Tlx Technician, Pall 

i?84 

-Internal Revenue Officer 
Spring, 1986" 

•Social Security Clsins 
Author iter end Claim 
Representative, Winter, 

•Conputer Specialist, 
Pall, 1982 

•Computer Specialist, 

Pall, 1982 
•Tax Technician, Pall, 

1984 

•Contract Specialist, 
Spring, 1986 



Ability to learn the 
job related natericl 
required to perform 
in an entry-level 
position and progress 
to the journey level 



Compatibility between 
an applicant's prefer- 
ences, and special 
characteristics of Job 

Knowledge of Contract 
Specialist work 



Cood 
Job Perf-.46 2 
Training-. 80* 



Hot useful 
Job Perf».03 
Training* -.04 



Good 
Job Perf-.38 l 
Training-^ 1 



826 (perf.) 
8*7 (trng.) 



344 (perf.) 
594 (trng.) 



393 (perf.) 
410 (trng.) 



Large* Por entrylcvel 

positions; better 
for more struc- 
tured Jobs; past 
use for trades 
occupations 



•one* ' 



Urge* 



l*5et-- -Tex Technician, Pall, 
bared. 1984 

Bnterview •Internal Keyenue Officer, 



Interpersonal "neet and 
deal" abilities 



Spring, 1986 
•SSA CI 4 ins Representative, 
Vintcri 1987 

•Custoas Inspector, Pall, 1986 
•Contract Specialist, Spring, 1986 



til 



-Computer Specialist, 
Pall, 1982 



Applicant's ability and 
aotivation to per fore 
Job predicted front 
achieveacnta and 
experiences 



**ifie 

Set 



•Custoas Inspector, Pall, 
1986 t 

•SSA Claivis Representative, 
not used opera tione 1 ly 
; by agency regno a t 



Ability to under stand 
job related nath and 
reading neteriels 



Moderate** 
Job Perf«.49 2 
Training-. 38* 



733 (perf.) 
704 (trng.) 



Undetermined* 
Job l-erf-.04 162 (perf.) 
Training-.38 2 218 (trng.) 



Jab.ParN.37 1 
Training-. 31 l 



400 (perf.) 
498 (trng.) 



ill* 



Snail* 



large* 



Hess screening oi 
applicants diffi- 
cult because of 
tine and personnel 
required to ednin* 
is tor the iatervic 



Snail aaaple 
wake instable esti 
nates of validity 
coefficients; thit 
is our weakest 
database ~ 

Basicr te develop 
than traditional 
ability teat bat 
has eaeivelent 



validity 



Wut. /'Wo overall unfairness (wider the Claary node!) against, siinorities noted for anp of the aelectlea procedures' 
inverse in pact statistics based on the Ooiforv Guidelines (1976) 80T rule are unavailable because the ana 11 number • of 
llrea relative to the nunbers of applicants nekea these analyses neireliable. The autistic which ia prcaentad in this 
' rt is effect else which is the difference between the Man stores for the aajority group (white) and a ainerfty group 
(here bUck only) divided by a wee tore of the variation of the scores. Cohen (1970) indicates that affect aiaee of last 
-<tt^,.tO show n| iapect, effect sises of .20 to .30 are ana 11, .50 to .60 are •ediwa, and aver .80 are large, 
[tlidity is expressed for the interview as a ranking procedure. Ope re U anally, it was weed as ecreen-eet eeehaaian. 
ftrccn-eu t procedures , by def ini tion , cannot be validated because there is no criterion deu far theae ecreened-out. 
Osrreletion derived fron nets *analr« is across saaples and corrected far criterion unreliability. 

latino derivtd fror. neta-analy tie across eauples and cerrec ted for criterion ware liability and range restriction. 
. i ty coefficients -f or perf ervaoce end training criteria are incens is tent , pas is inly because af anal 1 eaapl t nines 
p£se*pl^ Orerall^vnliditjAMt^dete 




Table 3 



Correlational and Effect Size Statistics for Estimating Composite 
Validity and Group Difference 



Effect Size Statistics 



Structured Interview 





Effect- 


Black Group 


White Group 




Occupation 


Size (d) 


Mean 


SD . N 


Mean 


SD 


N 


Tax Technician 


-.04 


2.78 


1.26 112 


2.72 


1.36 


306 


Internal Revenue 
Officer 


.39 


2.87 


1.33 244 


3.30 


1.23 


421 


Claims Represen- 
tative 


.15 


3.30 


.88 40 


3.44 


.95 


63 


Contract Speci- 
alist 


.20 


3.33 


1.06 83 


3.55 


.99 


267 


N-veighted d 


.24 • 












MT&E . 




Effect 


Black Group 


White Group 




Occupation 


Size (d) 


Mean 


SD N 


Mean 


SD 


N 


Computer Speci- 
alist 


1.10 


45". 33 


15.29 2041 


61.53 


12.22 


7672 


Tax Technician 


1.03 


29.58 


9.83 1210 


40.92 


9.06 


19C3 


Internal Revenue 
Officer 


.90 


41.23 


6.85 1291 


47.72 


5.30 


2784 



N-weighted d 1.01 



Mean weighted correlation of MT&E and Interview 
(corrected for unreliability) = .21 

Mean weighted reliability of Interview = .91 

Mean weighted reliability of MT&E = .92 

validity of equally weighted ccnposite of MT&E and 

Interview corrected for range restriction and unreliability = .61 

Effect size of equally weighted composite of MT&E and 
Interview corrected for predictor unreliability •« .83 



